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GENERAL MAXIMUM LIKELIHOOD EMPIRICAL BAYES 
ESTIMATION OF NORMAL MEANS 

By Wenhua Jiang and Cun-Hui Zhang 1 

Rutgers University 

We propose a general maximum likelihood empirical Bayes (GM- 
LEB) method for the estimation of a mean vector based on obser- 
vations with i.i.d. normal errors. We prove that under mild moment 
conditions on the unknown means, the average mean squared er- 
ror (MSE) of the GMLEB is within an infinitesimal fraction of the 
minimum average MSE among all separable estimators which use a 
single deterministic estimating function on individual observations, 
provided that the risk is of greater order than (log n) /n. We also 
prove that the GMLEB is uniformly approximately minimax in reg- 
ular and weak l v balls when the order of the length-normalized norm 
of the unknown means is between (log?i) K1 /n 1 ' / ' pA2 ^ and n/(logn) K2 . 
Simulation experiments demonstrate that the GMLEB outperforms 
the James-Stein and several state-of-the-art threshold estimators in 
a wide range of settings without much down side. 

1. Introduction. This paper concerns the estimation of a vector with 
i.i.d. normal errors under the average squared loss. The problem, known 
as the compound estimation of normal means, has been considered as the 
canonical model or motivating example in the developments of empirical 
Bayes, admissibility, adaptive nonparametric regression, variable selection, 
multiple testing and many other areas in statistics. It also carries significant 
practical relevance in statistical applications since the observed data are 
often understood, represented or summarized as the sum of a signal vector 
and the white noise. 

There are three main approaches in the compound estimation of normal 
means. The first one is general empirical Bayes (EB) [27, 30], which assumes 
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essentially no knowledge about the unknown means but still aims to attain 
the performance of the oracle separable estimator based on the knowledge 
of the empirical distribution of the unknowns. Here a separable estimator is 
one that uses a fixed deterministic function of the ith observation to estimate 
the zth mean. This greedy approach, also called nonparametric EB [26], was 
proposed the earliest among the three, but it is also the least understood, in 
spite of [28, 29, 30, 36, 37, 38]. Efron [15] attributed this situation to the lack 
of applications with many unknowns before the information era and pointed 
out that "current scientific trends favor a greatly increased role for empir- 
ical Bayes methods" due to the prevalence of large, high-dimensional data 
and the rapid rise of computing power. The methodological and theoretical 
challenge, which we focus on in this paper, is to find the "best" general EB 
estimators and sort out the type and size of problems suitable for them. 

The second approach, conceived with the celebrated Stein's proof of the 
inadmissibility of the optimal unbiased estimator and the introduction of 
the James-Stein estimator [22, 31], is best understood through its paramet- 
ric or linear EB interpretations [16, 17, 26]. The James-Stein estimator is 
minimax over the entire space of the unknown mean vector and well ap- 
proximates the optimal linear separable estimator based on the oracular 
knowledge of the first two empirical moments of the unknown means. Thus, 
it achieves the general EB optimality when the empirical distribution of the 
unknown means are approximately normal. However, the James-Stein esti- 
mator does not perform well by design compared with the general EB when 
the minimum risk of linear separable estimators is far different from that of 
all separable estimators [36] . Still, what is the cost of being greedy with the 
general EB when the empirical distribution of the unknown means is indeed 
approximately normal? 

The third approach focuses on unknown mean vectors which are sparse 
in the sense of having many (near) zeros. Such sparse vectors can be treated 
as members of small £ p balls with p < 2. Examples include the estimation of 
functions with unknown discontinuity or inhomogeneous smoothness across 
different parts of a domain in nonparametric regression or density problems 
[13]. For sparse means, the James-Stein or the oracle linear estimators could 
perform much worse than threshold estimators [12]. Many threshold meth- 
ods have been proposed and proved to possess (near) optimality properties 
for sparse signals, including the universal [13], SURE [14], FDR [1, 2], the 
generalized C p [3] and the parametric EB posterior median (EBThresh) [24] . 
These estimators can be viewed as approximations of the optimal candidate 
in certain families of separable threshold estimators, so that they do not 
perform well by design compared with the general EB when the minimum 
risk of separable threshold estimators is far different from that of all separa- 
ble estimators [38]. Again, what is the cost of being greedy with the general 
EB when the unknown means are indeed very sparse? 
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Since general EB methods have to spend more "degrees of freedom" for 
nonparametric estimation of its oracle rule, compared with linear and thresh- 
old methods, the heart of the question is whether the gain by aiming at the 
smaller general EB benchmark risk is large enough to offset the additional 
cost of the nonparametric estimation. 

We propose a general maximum likelihood EB (GMLEB) in which we first 
estimate the empirical distribution of the unknown means by the generalized 
maximum likelihood estimator (MLE) [25] and then plug the estimator into 
the oracle general EB rule. In other words, we treat the unknown means as 
i.i.d. variables with a completely unknown common "prior" distribution (for 
the purpose of deriving the GMLEB, whether the unknowns are actually 
deterministic or random), estimate the nominal prior with the generalized 
MLE, and then use the Bayes rule for the estimated prior. The basic idea 
was discussed in the last paragraph of [27] as a general way of deriving 
solutions to compound decision problems, although the notion of MLE was 
vague at that time without a parametric model and not much has been done 
since then about using the generalized MLE to estimate the nominal prior 
in compound estimation. 

Our results affirm that by aiming at the minimum risk of all separable 
estimators, the greedier general EB approach realizes significant risk reduc- 
tion over linear and threshold methods for a wide range of the unknown 
signal vectors for moderate and large samples, and this is especially so for 
the GMLEB. We prove that the risk of the GMLEB estimator is within an 
infinitesimal fraction of the general EB benchmark when the risk is of the 
order re -1 (log n) 5 or greater depending on the magnitude of the weak £ p 
norm of the unknown means, <p < oo. Such adaptive ratio optimality is 
obtained through a general oracle inequality which also implies the adaptive 
minimaxity of the GMLEB over a broad collection of regular and weak £ p 
balls. This adaptive minimaxity result unifies and improves upon the adap- 
tive minimaxity of threshold estimators for sparse means [1, 14, 24] and the 
Fourier general EB estimators for moderately sparse and dense means [38]. 
We demonstrate the superb risk performance of the GMLEB for moderate 
samples through simulation experiments, and describe algorithms to show 
its computational feasibility. 

The paper is organized as follows. In Section 2, we highlight our results 
and formally introduce certain necessary terminologies and concepts. In Sec- 
tion 3 we provide upper bounds for the regret of a regularized Bayes rule us- 
ing a predetermined and possibly misspecified prior. In Section 4 we prove an 
oracle inequality for the GMLEB, compared with the general EB benchmark 
risk. The consequences of this oracle inequality, including statements of our 
adaptive ratio optimality and adaptive minimaxity results in full strength, 
are also discussed in Section 4. In Section 5 we present more simulation re- 
sults. Section 6 contains some discussion. Mathematical proofs of theorems, 
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propositions and lemmas are given either right after their statements or in 
the Appendix. 

2. Problem formulation and highlight of main results. Let X{ be inde- 
pendent statistics with 

(2.1) Xi~<p(x-Oi)~N(0i,i), i = l,...,n, 

under a probability measure P n ^Q, where 9 = (9\, . . . , 9 n ) is an unknown 
signal vector. Our problem is to estimate 6 under the compound loss 

1 n ^ 

(2.2) L n {o,e) = n - l \\o - ef = ~y>i - dif 

n ~ 

i=i 

for any given estimator 6 = (9\, . . . ,9 n ). Throughout this paper, the un- 
known means 9{ are assumed to be deterministic as in the standard com- 
pound decision theory [27]. To avoid confusion, the Greek 9 is used only 
with boldface as a deterministic mean vector 9 in W 1 or with subscripts 
as elements of 0. A random mean is denoted by £ as in (2.3) below. The 
estimation of i.i.d. random means is discussed in Section 6.3. 

We divide the section into seven subsections to describe (1) the general 
and restricted EB, (2) the GMLEB method, (3) the computation of the 
GMLEB, (4) some simulation results, (5) the adaptive ratio optimality of 
the GMLEB, (6) the adaptive minimaxity of the GMLEB in £p balls and 
(7) minimax theory in £ p balls. 

Throughout the paper, boldface letters denote vectors and matrices, for 
example, X = (X\, . . . ,X n ), <p(x) = e~ x I 2 j\p2/K denotes the standard nor- 
mal density, L(y) = \J — log(27ry 2 ) denotes the inverse of y = (f(x) for posi- 
tive x and y, x V y = max(x,y), x Ay = mm(x,y), i + =iV0 and a n x b n 
means < a n /b n + b n /a n = 0(1). In a number of instances, log(x) should be 
viewed as log(x V e). Univariate functions are applied to vectors per com- 
ponent. Thus, an estimator of is separable if it is of the form = t(X) = 
(t(Xi), . . . ,t(X n )) with a predetermined Borel function t(-). In the vector 
notation, it is convenient to state (2.1) as X~iV(0,I n ) with I n being the 
identity matrix in M. n . 

2.1. Empirical Bayes. The compound estimation of a vector of deter- 
minist normal means is closely related to the Bayes estimation of a single 
random mean. In this Bayes problem, we estimate a univariate random pa- 
rameter £ based on a univariate Y such that 



(2.3) 



y|£~iV(£,l), £~G, under P G . 
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The prior distribution G = G n which naturally matches the unknown means 
< n} in (2.1) is the empirical distribution 

1 n 

(2.4) G n (u) = G n , e (u) = -T,I{e i <u}. 

i=i 

Here and in the sequel, subscripts n $ indicate dependence of distribution or 
probability upon n and the unknown deterministic vector 0. 

The fundamental theorem of compound decisions [27] in the context of the 
£2 loss asserts that the compound risk of a separable rule = t(X.) under 
the probability P n ,o in the multivariate model (2.1) is identical to the MSE 
of the same rule £ = t(Y) under the prior (2.4) in the univariate model (2.3): 

(2.5) E nt0 L n (t(X),O) = E Gn (t{Y)-O 2 . 
For any true or nominal priors G, denote the Bayes rule as 

,,a, fe-^wi-O'- Vffrg* 

and the minimum Bayes risk as 

(2.7) R*(G) = E G (t* G (Y)-0 2 , 

where the minimum is taken over all Borel functions. It follows from (2.5) 
that among all separable rules, the compound risk is minimized by the Bayes 
rule with prior (2.4), resulting in the general EB benchmark 

(2.8) R%G n )=E rit eL n (t Gn (X),0) = mm E n<e L n (t(X),0). 

H') 

The general EB approach seeks procedures which approximate the Bayes 
rule t*Q (X) or approximately achieve the risk benchmark R*{G n ) in (2.8). 
Given a class of functions @, the aim of the restricted EB is to attain 

(2.9) Roj(G n ) = inf E n jL n (t(X), 6) = inf E Gn (t(Y) - £) 2 , 

approximately. This provides EB interpretations for all the adaptive meth- 
ods discussed in the Introduction, with Ql being the classes of all linear 
functions for the James-Stein estimator, all soft threshold functions for the 
SURE [14] , and all hard threshold functions for the generalized C p [3] or the 
FDR [1]. For the EBThresh [24], *3l is the class of all posterior median func- 
tions t(y) = median(£|Y = y) under the probability P G in (2.3) for priors of 
the form 

(2.10) G{u) = uj I{0 < u} + (1 - wo)G (n/r), 

where lvq and r are free and Go is given [e.g., dGo(u)/du = e~' il '/2]. 
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Compared with linear and threshold methods, the general EB approach 
is greedier since it aims at the smaller benchmark risk: R*(G n ) < R^{G n ) 
for all 9> . This could still backfire when the regret 

(2.11) r n>0 (t n ) = E n>e L n (t n (X.),G) - R*(G n ) 

of using an estimator t n (-) of the general EB oracle rule t G (•) is greater 
than the difference R@(G n ) — R*(G n ) in benchmarks, but our simulation 
and oracle inequalities prove that r n: o{t n ) = o(l)R*(G n ) uniformly for a 
wide range of the unknown vector and moderate/large samples. 

Zhang [36] proposed a general EB method based on a Fourier infinite- 
order smoothing kernel. The Fourier general EB estimator is asymptotically 
minimax over the entire parameter space and approximately reaches the gen- 
eral EB benchmark (2.8) uniformly for dense and moderately sparse signals, 
provided that the oracle Bayes risk is of the order n -1 / 2 (logn) 3 / 2 or greater 
[36] . Hybrid general EB estimators have been developed [38] to combine the 
features and optimality properties of the Fourier general EB and thresh- 
old estimators. Still, the performance of general EB methods is sometimes 
perceived as uncertain in moderate samples [24]. Indeed, the Fourier general 
EB requires selection of certain tuning parameters and its proven theoretical 
properties are not completely satisfying. This motivates our investigation. 

2.2. The GMLEB. The GMLEB method replaces the unknown prior G n 
of the oracle rule t G by its generalized MLE [25] 

n 

(2.12) G n = G n (-;X) = axgmaxTT/G(X i ), 

where is the family of all distribution functions and fa is the density 

(2.13) f G (x)= J \p{x-u)G(du) 

of the normal location mixture by distribution G. 

The estimator (2.12) is called the generalized MLE since the likelihood is 
used only as a vehicle to generate the estimator. The G here is used only 
as a nominal prior. In our adaptive ratio and minimax optimality theorems 
and oracle inequality, the GMLEB is evaluated under the measures P n ^ in 
(2.1) where the unknowns 6i are assumed to be deterministic parameters. 

Since (2.12) is typically solved by iterative algorithms, we allow approx- 
imate solutions to be used. For definiteness and notation simplicity, the 
generalized MLE in the sequel is any solution of 

n n 

(2.14) G n e J /g (Xi) > q n sup J] f G (Xi) 
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with q n = (ev27r / 'n 2 ) A 1, although the theoretical results in this paper all 
hold verbatim for less stringent (2.14) with < log(l/g n ) < co(logra) for any 
fixed constant cq. Formally, the GMLEB estimator is defined as 

(2.15) 9 = t% (X) or equivalently % = t% (Xi), i=l,...,n, 

where £q is the Bayes rule in (2.6) and G n is any approximate generalized 
MLE (2.14) for the nominal prior (2.4). Clearly the GMLEB estimator 
(2.15) is completely nonparametric and does not require any restriction, 
regularization, bandwidth selection or other forms of tuning. 
The GMLEB is location equivariant in the sense that 

^ %4-;X +C e)( X + ce )=^C;X)( X )+ ce 

for all real c, where e = (1, . . . , 1) € M. n . This is due to the location equiv- 
ariance of the generalized MLE: G n (x;X. + ce) = G n (x — c;X). Compared 
with the Fourier general EB estimators [36, 38], the GMLEB (2.15) is more 
appealing since the function t*- (x) of x enjoys all analytical properties of 

Bayes rules: monotonicity, infinite differentiability and more. However, the 
GMLEB is much harder to analyze than the Fourier general EB. We first 
address the computational issues in the next section. 

2.3. Computation of the GMLEB. It follows from the Caratheodory's 
theorem [9] that there exists a discrete solution of (2.12) with no more than 
n + 1 support points. A discrete approximate generalized MLE G n with m 
support points can be written as 

m m 

(2.17) G n = J2wjS Uj , Wj>0, ]T% = 1, 

i=i j=i 
where 5 U is the probability distribution giving its entire mass to u. Given 
(2.17), the GMLEB estimator can be easily computed as 



(2.18) e i = t* d (x i )- 

since t^'(x) is the conditional expectation as in (2.6). 

Since the generalized MLE G n is completely nonparametric, the support 
points {uj,j < m} and weights {ulj,j < m} in (2.17) are selected or com- 
puted solely to maximize the likelihood in (2.12). There are quite a few 
possible algorithms for solving (2.14), but all depend on iterative approxi- 
mations. Due to the monotonicity of ip(t) in t 2 , the generalized MLE (2.12) 
puts all its mass in the interval Iq = [mini<j< n Xt,maxi<j< n Xj]. Given a 
fine grid {uj} in Iq, the EM-algorithm [11, 35] 
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optimizes the weights {"%}■ In Section 6.2, we provide a conservative statis- 
tical criterion on {uj} and an EM-stopping rule to guarantee (2.14). 

We took a simple approach in our simulation experiments. Given {Xj, 1 < 
i < n} and with Xq = 0, we chose the grid points {uj} as a set of multipliers 
of e = maxo<i<j< n |Xj — Xj|/999 with Uj = %-i + e and the range 

— he = u\ — e < min Xi < u\, u m = (m— jr,)e < max X, < u m + e 

0<i<n 0<i<n 

with an integer j'q £ [1,^]- This ensures Uj = as a grid point and 999 < 
m < 1000. We ran 100 EM-iterations (2.19) in our simulations. We have 
tried to optimize both the support points {uj} and weights {wj} in the 
EM-algorithm, but gained limited improvements. 

The GMLEB estimator (2.18) depends slightly on the initialization of the 
EM-algorithm due to the nonuniqueness of the GMLEB estimator and the 
fixed number of EM-iterations in our implementation. Since the generalized 
MLE (2.12) is unique only up to the values of {/g (Xi),i < n}, different 

EM- initializations lead to different versions of G n , which then result in dif- 
ferent values of i~ (Xj) in (2.18). This nonuniqueness persists even when 

we run infinitely many EM-iterations. Nevertheless, our theoretical results 
hold for all versions of the GMLEB. 

We consider two options in our simulation experiments. The first option 
initializes the weights with the uniform distribution Wj = 1/m. The sec- 
ond option takes into consideration of the possible sparsity of the signal by 
putting a good starting mass at Uj = 0: 

(2.20) w jo = uj , wj = - — ^, j 7^ j - 

m — 1 

We estimate the proportion of zeros within the n means by a Fourier method, 

1 n f 2/ 

Ti . J 

3=1 

as in [32, 33], where ipo is a density function with support [—1,1] and h n = 
{^(logn)}" 1 / 2 is the bandwidth, k < 1. In our simulation experiments, the 
uniform [—1,1] density is used as Vo an d k = 1/2. To distinguish the two 
options of initializing the EM-algorithm, we reserve the name GMLEB for 
the uniform initialization and call (sparse-) S-GMLEB the estimator with 
the initialization (2.20) when we report simulation results. 

2.4. Some simulation results. Johnstone and Silverman [24] reported re- 
sults of an extensive simulation study of 18 threshold estimators, including 
eight options of their EBThresh, the SURE and adaptive SURE [14], the 
FDR [1] at three levels, three block threshold methods [7, 8] and the soft and 
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Table 1 

Average total squared errors \\0 — 8\\ 2 for n = 1000 unknown means in various 
binary models where 8j is either or fi with the number of nonzero 0i = fi being 5, 50 or 
500. The "Best" stands for the best simulation results in Table 1 of Johnstone and 

Silverman [24]. 
Each entry is based on 100 replications 



# nonzero 5 50 500 





3 


4 


5 


7 


3 


4 


5 


7 


3 


4 


5 


7 


James-Stein 


45 


76 


113 


199 


312 


442 


556 


716 


822 


889 


933 


954 


EBThresh 


37 


34 


20 


8 


212 


151 


103 


74 


862 


873 


792 


653 


SURE 


42 


64 


7:5 


75 


416 


609 


215 


214 


835 


834 


842 


828 


FDR (0.01) 


43 


54 


29 


6 


388 


299 


132 


57 


2587 


1322 


667 


520 


FDR (0.1) 


42 


38 


21 


13 


278 


163 


115 


99 


1162 


744 


662 


640 


GMLEB 


39 


34 


23 


11 


157 


105 


58 


14 


459 


285 


139 


18 


S-GMLEB 


32 


28 


17 


6 


150 


99 


54 


10 


454 


282 


136 


15 


F-GEB 


94 


94 


89 


88 


223 


185 


135 


103 


520 


363 


237 


131 


HF-GEB 


37 


34 


20 


8 


197 


150 


99 


72 


499 


334 


192 


83 


"Best" 


34 


32 


17 


5 


201 


156 


95 


52 


829 


730 


609 


505 


Oracle 


27 


22 


12 


0.8 


144 


93 


46 


3 


443 


273 


128 


8 



hard threshold at the universal threshold level \/21ogn. In their simulations, 
the overall best performer is the EBThresh using the posterior median for 
the prior (2.10) with the double exponential dGo(u)/du = e~' u '/2 and the 
MLE of (loq,t). 

In Table 1, we display our simulation results under exactly the same set- 
ting as in [24] for nine estimators: the James-Stein, the EBThresh [24] using 
the double exponential dG in (2.10) and the MLE of (uj q , t), the SURE [14], 
the FDR [1] at levels q = 0.01 and q = 0.1, the GMLEB (2.15) with the uni- 
form initialization, the S-GMLEB with the initialization (2.20), the F-GEB 
and HF-GEB as the Fourier general EB [36] and a hybrid [38] of its mono- 
tone version with the EBThresh. In each column, boldface entries denote 
the top three performers other than the hybrid estimator. We also display 
as "Best" the best of the simulation results in [24] over the 18 threshold 
estimators and as Oracle the average simulated risk of the oracle Bayes rule 
t* Gn in (2.8). 

These simulation results can be summarized as follows. The average £2 
loss of the S-GMLEB happens to be the smallest among the nine estimators, 
with the S-GMLEB and GMLEB clearly outperforming all other methods 
by large margins for dense and moderately sparse signals. For very sparse 
signals, the S-GMLEB, the EBThresh, the GMLEB and the FDR estimators 
yield comparable results, and they all outperform the Fourier general EB 
and James-Stein estimators. Compared with the oracle, the regrets of the 
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S-GMLEB and GMLEB are nearly fixed constants. Since the oracle prior 
(2.4) has a point mass at in all the models used to generate data in 
this simulation experiment, the S-GMLEB yields slightly better results than 
the GMLEB as expected. The hybrid estimator correctly switches to the 
EBThresh for very sparse signals. 

These simulations and more presented in Section 5 demonstrate the com- 
putational affordability of the proposed GMLEB. The most surprising aspect 
of the results in Table 1 is the strong performance of the both versions of 
the GMLEB for the most sparse signals with 0.5% of 9i being nonzero, since 
the GMLEB is not specially designed to recover such signals (and threshold 
estimators are). 

2.5. Adaptive ratio optimality. Our theoretical results match well with 
the supreme performance of the GMLEB in our simulation experiments. We 
describe here the adaptive ratio optimality of the GMLEB and in the next 
section the adaptive minimaxity of the GMLEB in £ p balls. 

The adaptive ratio optimality holds for an estimator 9 : X — > M. 11 if its risk 
is uniformly within a fraction of the general EB benchmark 

(2-21) Bup ^ (8 ' g) <l + (1) 

in certain classes 0* C ffi n of the unknown vector 9, where L n (-,-) is the 
average squared loss (2.2), G n ,0 — G n is the empirical distribution of the un- 
knowns in (2.4) and R*{G n ) is the general EB benchmark risk (2.8) achieved 
by the oracle Bayes rule t* Gn (X). 

Theorem 1. Let X ~ N(6,l n ) under P n e with a deterministic 9 G W l . 
Let (•) be the GMLEB in (2.15) with an approximate solution G n satis- 
fying (2.14). Let G n = G^ and R*(G) be as in (24) and (2.7). Then, 

E nfi L n {tX (X), 9) EnfiWfy (X) - 9\\ 2 
^ Irfc) ^ min t 4kx)-.||^ 1 + ^ 

for the compound loss (2.2), provided that for certain constants b n 

nR*{G n ) 

— y oo . 

(y/logn Vmaxj< n |6»i - fr n |)(logn) 9 / 2 

In particular, if maxj< n \ 6i — b n \ = 0(\J\ogn) and nR* (G n ) / (log n) 5 — > oo, 
then (2.22) holds. 

For any sequences of constants M n — > oo, Theorem 1 provides the adaptive 
ratio optimality (2.21) of the GMLEB in the classes 

@* n = {9e ^ n :R*(G ni0 ) > M n n-\\ognfl 2 {^g~^ V p^)}. 
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This is a consequence of an oracle inequality for the GMLEB t n = t~ in 
Section 4.2, which uniformly bound from the above 

(2.23) r nt g(t n ) = ^E n>g L n (t n (X),0) - yj R*(G n ) 

in terms of the weak £ p norm of 6. The quantity (2.23) can be viewed 
as the regret for the minimization of the square root of the MSE, instead 
of (2.11). Clearly, r n<e (t n )/R*(G n ) < o(l) iff r n>0 (t% )/^R*(G n ) < o(l). A 
more general version of Theorem 1 is given in Section 4.3. 

In the EB literature, the asymptotic optimality of is defined as 

(2.24) G n ^G => E n!0 L n (G,O)-R*(G n )^O 
for deterministic vectors G W 1 [27, 36]. In the EB model 

(2.25) [Y,.-,] i.i.d.. Y^-N^l), &~G, under P G 
with data {Yi}, the EB asymptotic optimality is defined as 

n 

(2.26) lim E G J2(l ~ &)7« = R*{G). 

i=l 

We call (2.21) adaptive ratio optimality since it is much stronger than both 
notions of asymptotic optimality in its uniformity in 6 G 0* and its focus on 
the harder standard of the relative error, due to R*(G n ) < E n gL n (X., 6) = 1. 
The difference among these optimality properties is significant for moderate 
samples in view of some very small R*(G n ) ~ Oracle/1000 in Table 1. 

Theorem 1 is location invariant, since the GMLEB is location equivariant 
by (2.16) and R*{G n ) is location invariant by (2.8). Thus, if 8i = b n for most 
i <n, the GMLEB performs equally well whether b n = or not. Moreover, if 
9i G B Vi for a finite set BcK, the GMLEB adaptively shrinks toward the 
points in B [19]. This is evident in Table 1 for #{i : 0; = 7} G {50, 500} with 
B = {0, 7}. In fact, if #{x : x G B n } = 0(1) and mm Bn5x ^ yeBn \x - y\ -> oo, 
then G n (B n ) = 1 implies R*(G n ) — > 0. Threshold methods certainly do not 
possess these location invariance and multiple shrinkage properties. 

2.6. Adaptive minimaxity in £ p balls. Minimaxity is commonly used to 
measure the performance of statistical procedures. For O C W 1 , the minimax 
risk for the average squared loss (2.2) is 

(2.27) ^ n (9)=inf sup E n>e L n (d, 0), 

e 6>g0 

where the infimum is taken over all Borel mappings 6 : X — ► M n . An estimator 
is minimax in a specific class of unknown mean vectors if it attains & n (@), 
but this does not guarantee satisfactory performance since the minimax 



12 



W. JIANG AND C.-H. ZHANG 



estimator is typically uniquely tuned to the specific set 0. For small 0, 
the minimax estimator has high risk outside B. For large 0, the minimax 
estimator is too conservative by focusing on the worst case scenario within 
0. Adaptive minimaxity overcomes this difficulty by requiring 

(2 - 28) — * 1 

uniformly for a wide range of sequences {0 n C M. n ,n > 1} of parameter 
classes. Define (regular or strong) l p balls as 

(2.29) Q Pt c, n = U = (e 1 ,...,e n ):n- 1 J2 W < 

The quantity C in (2.29), called length-normalized or standardized radius 
of the l p ball, is denoted as i] in [1, 12, 24], where adaptive minimaxity in 
£ p balls with C = C n — > and p < 2 is used to measure the performance 
of estimators for sparse 9. The following theorem establishes the adaptive 
minimaxity of the GMLEB in £ p balls with radii C = C n in intervals di- 
verging to (0,oo). This covers sparse and dense 9 simultaneously. Adaptive 
minimaxity of the GMLEB in weak £ p balls is discussed in Section 4.3. 

Theorem 2. Let X ~ N(9,l n ) under P n Q with a deterministic 9 e R n . 
Let 9 = t*~ (X) be the GMLEB in (2.15) with an approximate solution G n 

satisfying (2.1^). LetL n (-,-) be the average squared loss (2.2) and M n {Q) be 
the minimax risk (2.27). Then, as n — > oo, the adaptive minimaxity (2.28) 
holds in t p balls (2.29) with Q n = Q P: c„,n, provided that 

(2.30) £ r-T^°° ; hL(lognr 2 ^0, 

v ; (logn) K i(p) n y ' 

where «i(p) = 1/2 + 4/p + 3/p 2 for p<2, k x (2) = 13/4, Kl (p) = 5/2 for 
p>2, and K 2 (p) = 9/2 + 4/p. 

Theorem 2 is a consequence of the oracle inequality in Section 4.2 and 
the minimax theory in [12] . An outline of this argument is given in the next 
section. An alternative statement of the conclusion of Theorem 2 is 

sup 0eGpCn E n>0 L n (t* d (X),0) 
lim sup '— r— = 1, 

where V p , n (M) = [M n - l /^ h2 \\ognY^P\n/ {M(\ogn) K ^}\. In Section 4.3, 
we offer an analogues result for weak t p balls. The powers K\(p) and ^(p) of 
the logarithmic factors in (2.30) and in the definition of ^, jn (M) are crude. 
This is further discussed in Section 6. 



GENERAL MAXIMUM LIKELIHOOD EB 



13 



Adaptive and approximate minimax estimators of the normal means in £ p 
balls have been considered in [1, 3, 12, 14, 24, 36, 38]. Donoho and Johnstone 
[14] proved that as (n,C n ) — > (oo,0+), with nC^/(logn) p / 2 — > oo for p < 2, 

(2.31) ^ n (e PiCn ,n) = (l + o(l))min max E n ,gL n (t(X) , 0) , 

where ^ is the collection of all (soft or hard) threshold rules. Therefore, 
adaptive minimaxity (2.28) in small £ p balls n = Pi c n ,n can be achieved 
by threshold rules with suitable data-driven threshold levels. This has been 
done using the FDR [1] for (log n) 5 /n < C!° < n~ K with p < 2 and any k > 0. 
Zhang [38] proved that (2.28) holds for the Fourier general EB estimator of 
[36] in 6 n = G p ,c„,n for C^/(logn) 1+ (P A2 )/ 2 - oo. 

A number of estimators have been proven to possess the adaptive rate 
minimaxity in the sense of attaining within a bounded factor of the minimax 
risk. In £ p balls @ Pl c„,n, the EBThresh is adaptive rate minimax for p < 2 and 
nC]? > (logn) 2 [24], while the generalized C p is adaptive rate minimax for 
p < 2 and 1 < 0(l)nC^ [3]. It follows from [3, 38] that a hybrid between the 
Fourier general EB and universal soft threshold estimators is also adaptive 
rate minimax in Q Pt c„,n for 1 < 0(l)nC%. 

The adaptive minimaxity as provided in Theorem 2 unifies the adaptive 
minimaxity of different types estimators in different ranges of the radii C n 
of the £ p balls with the exception of the two very extreme ends, due to the 
crude power k% (p) of the logarithmic factor for small C n and the requirement 
of an upper bound for large C n . The hybrid Fourier general EB estimator 
achieves the adaptive rate minimaxity in a wider range of l p balls than what 
we prove here for the GMLEB. However, as we have seen in Table 1, the 
finite sample performance of the GMLEB is much stronger. It seems that 
the less stringent and commonly considered adaptive rate minimaxity leaves 
too much room to provide adequate indication of finite sample performance. 

2.7. Minimax theory in t p balls. Instead of the general EB approach, 
adaptive minimax estimation in small t p balls can be achieved by threshold 
methods, provided that the radius is not too small. However, since (2.31) 
does not hold for fixed p > and C € (0, oo), threshold estimators are not 
asymptotically minimax with O n = O p ,c,n in (2.28) for fixed (p,C). Conse- 
quently, adaptive minimax estimations in small, fixed and large l p balls are 
often treated separately in the literature. In this section, we explain the gen- 
eral EB approach for adaptive minimax estimation, which provides a unified 
treatment for i p balls of different ranges of radii. This provides an outline of 
the proof of Theorem 2. Minimax theory in weak £ p balls will be discussed 
in Section 4.3. 

We first discuss the relationship between the minimax estimation of a 
deterministic vector 6 in £p balls ctnd th.6 minima-x estimation of a single 
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random mean under an unknown "prior" in L p balls. For positive p and C, 
the L p balls of distribution functions are defined as 



& p C = )G: J \u\ p G(du)<C p 
Since & p .c is a convex class of distributions, the minimax theorem provides 

(2.32) a(%,c) = mm max E G (t(Y) - £) 2 = max R*(G) < 1 

for the estimation of a single real random parameter £ in the model (2.3), 
where R*(G) is the minimum Bayes risk in (2.7). Thus, since G n = G Ut g £ 
^ Pt c fo r £ ®p,c,n, the fundamental theorem of compound decisions (2.5) 
implies that (2.32) dominates the compound minimax risk (2.27) in £ p balls: 

(2.33) M n (Q p ,c,n) <inf sup E n>0 L n (t(X), 9) < M{%, c ) < 1. 

Donoho and Johnstone [12] proved that as C pA2 — > 0+ 

(2 34) 

1 ' j C'pA2{2log(l/CP)}( 1 -p/ 2 )+ 

and that for either p > 2 with C n > or p < 2 with nC p / (log n) p / 2 — ► oo, 

&n(@p,C n ,n) _ , „ 

In the general EB approach, the aim is to find an estimator t n of t G with 
small regret (2.11) or (2.23). If the approximation to t G in risk is sufficiently 
accurate and uniformly within a small fraction of &(w p> c n ) for € ® p .Cn,ni 
the maximum risk of the general EB estimator in Q p> c n> n would be within 
the same small fraction of &(@ Pj c n ), since the risk of t* G is bounded by 
R*(G n: e) < @{%,C n ) for € @p,c n ,n- Thus, (2.35) plays a crucial role in 
general EB. 

It follows from (2.23), (2.32) and (2.29) that 



(2.35) 



(2.36) sup JE n , e L n (t n (X),0)< sup r n , e (t n ) + J^(%,c)- 
0e® p ,c,n 0Ge p , c ,n 

Thus, by (2.34) and (2.35), the adaptive minimaxity (2.28) of = t n (X) in 
l p balls n = ®p : c n ,n is a consequence of an oracle inequality of the form 



(2.37) sup r nt e(t n ) = o{l)Jj Pt c n 

with J vfl = min{l, C pA2 {l V (2 log(l/C p ))}( 1 - p / 2 )+ }. In our proof, (2.34) and 
the upper bound &(& Pt c) < 1 provide mfc&{'& p ,c)/Jp,c > 0- Although J Pj c* 
provides the order of M(^S p ^c) for each p via (2.34), explicit expressions of 
the minimax risk & n (®p,c,n) for general fixed (p,C,n) or the minimax risk 
, t c) for fixed (p, C) with p^2 are still open problems. 
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3. A regularized Bayes estimator with a misspecified prior. In this sec- 
tion, we consider a fixed probability Pg under which 

(3.1) y|f~JV(£,l), z~ Go . 

Recall [5, 28] that for the estimation of a normal mean, the Bayes rule (2.6) 
and its risk (2.7) can be expressed in terms of the mixture density fc(x) as 

$H. R"(B) = 1- f(&Y 



(3 . 2) <&<«)-« + R'(0) = l-J[^)fa, 

in the model (2.3), where fa{x) = J <p{x — u)G(du) is as in (2.13). 

Suppose the true prior Go is unknown but a deterministic approximation 
of it, say G, is available. The Bayes formula (3.2) could still be used, but 
we may want to avoid dividing by a near-zero quantity. This leads to the 
following regularized Bayes estimator: 

(3 .3) tUx ., p)=x+J m rp 

For p = 0, t G (x; 0) = t G (x) ^ s ^ ne Bayes estimator for the prior G. For p = oo, 
t G (x;oo) = x gives the MLE of £ which requires no knowledge of the prior. 
The following proposition, proved in the Appendix, describes some analytical 
properties of the regularized Bayes estimator. 



Proposition 1. Let L{y) = — log(27n/ 2 ) ; y > 0, be the inverse func- 
tion of y = f(x). Then, the value of the regularized Bayes estimator t G (x;p) 
in (3.3) is always between those of the Bayes estimator t G (x) in (2.6) and 
the MLE t G (x; oo) = x. Moreover, for all real x 

(3.4, l^^Mrh-^' <^< 2 ^" 2 ' 

[ < (d/dx)t* G (x; p) < L 2 (p), 0<p< (2vre 3 )- 1 / 2 . 

Remark 1. In [36], a slightly different inequality 

^ (¥n) 2 l¥^r^ i2 ^ 0<p<(2 7 re 2 r 1/2 I 

\ fG{X) J fG{X)\>p 

was used to derive oracle inequalities for Fourier general EB estimators. The 
extension to the derivative of t G (x;p) here is needed for the application of 
the Gaussian isoperimetric inequality in Proposition 4. 

The next theorem provides oracle inequalities which bound the regret of 
using (3.3) due to the lack of the knowledge of the true Go- Let 

(3-6) d(/,5)=(/(/ 1/2 -5 1/2 ) 2 ) V2 
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denote the Hellinger distance. The upper bounds assert that the regret is 
no greater than the square of the Hellinger distance between the mixture 
densities / G and / Go up to certain logarithmic factors. 

Theorem 3. Suppose (3.1) holds under Pg - Let t* G {x;p) be the regu- 
larized Bayes rule in (3.3) with < p < (2ire 2 )~ 1 / 2 . Let f G be as in (2.13). 

(i) There exists a universal constant Mq such that 

[E Go {t* G (Y;p)-^-R*(G )] 1 / 2 

(3.7) < M max{| logp| 3 / 2 , | log(d(/ G , f Go ))\ l/2 }d(f G , f Go ) 

+ {/(^)l¥f' 

where R*(Gq) = E Go {t Go (Y) — £} 2 is the minimum Bayes risk in (2.7). 

(") V I\u\>x G oldu) <Mi|logp| 3 e§ and 2(x + l)p < M 2 \ logp\ 2 e 2 for a 
certain eq > d(f G ,f Go ) and finite positive constants {xq,M\,M2\ , then 

E Go {t G (.Y;p)-Z} 2 -R*(G ) 

(3.8) 

< 2(M + Mi + M 2 ) max(| logp| 3 , | loge |)£o. 
where Mq is a universal constant. 

Remark 2. For G = Gq (3.7) becomes an identity, so that the square 
of the first term on the right-hand side of (3.7) represents an upper bound 
for the regret of using a misspecified G in the regularized Bayes estimator 
(3.3) instead of the true Go for the same regularization level p. Under the 
additional tail probability condition on Go and for sufficiently small p, (3.8) 
provides an upper bound for the regret of not knowing Go, compared with 
the Bayes estimator (3.2) with the true G = Gq. 

Remark 3. Since the second term on the right-hand side of (3.7) is 
increasing in p and the first is logarithmic in 1/p, we are allowed to take 
p > of much smaller order than d(fcfGo) m (3-7), for example, under 
moment conditions on Go- Still, the cubic power of the logarithmic factors 
in (3.7) and (3.8) is crude. 

The following lemma plays a crucial role in the proof of Theorem 3. 

Lemma 1. Let d(f,g) be as in (3.6) and L(y) = ^ — log(2iry 2 ) . Then, 

(3 Q) I (/Wgo)' < e 2 2d 2 (/ G ,/ Go )max(L 6 (p),2a 2 ) 

J TG V p + / Go V p 

for p< 1/V2^, where a 2 = max{L 2 (p) + 1, | logd 2 (f G , / Go )|}. 



GENERAL MAXIMUM LIKELIHOOD EB 



17 



Proof of Theorem 3. Let 

\\g\\h = {J g 2 {x)h(x)dx^ 
be the L2(h(x) dx) norm for h> 0. Since t Go is the Bayes rule, by (3.3) 

[E Go {t G (Y;p) - £} 2 - E Go {t* Go (Y) - £} 2 ] 1/2 
(3-10) =\\f G /(fGVp)-f Go /f Go \\ fco 

< r(fc,p) + ||(1 - fG /p)+f Go /fG \\fG , 

where r(f G ,p) = ||/ G /(/ G V p) - f Go /(f Go V P )\\f Go ■ 

Let w t = l/(/ G V p + f Go V p). For d = G or d = G , 



f fv-p- 2f ^ w *) /Go -/(^ l/G " /G> *) /Go 

<L 2 (p) f(f G -f Go ) 2 w 2 J Go 



fa 



due to |/^|/(/ Gl V p) < L(p) by (3.4). Since (v75 + v^) 2 ™* < 2 and 
w*f Go < 1, we find 

r(/ G ,p) < 2||(/ G - / G >*||/ Go + 2L(p)||(/ G - / GO H||/ Oo 

< nfo - /goIU. + 2L(p)V2d(f G , f Go ). 

Thus, (3.7) follows from (3.10) and (3.9). 
To prove (3.8) we use Lemma 6.1 in [38]: 

2 

/Go 



fa <p V /g 



< / G (du) + 2x /9max{L 2 ( / 9), 2} + Ip^j L\p) + 2 

■/ 1 u\ >xo 

<(M l + M 2 )\logp\ 3 el 
due to | logp| > L 2 (p) > 2. This and (3.7) imply (3.8). □ 

4. An oracle inequality for the GMLEB. In this section, we provide an 
oracle inequality which bound the regret (2.23) and thus (2.11) of using the 
GMLEB t*~ in (2.15) against the oracle Bayes rule t* G in (2.8). We provide 
the main elements leading to the oracle inequality in Section 4.1 before 
presenting the oracle inequality and an outline of its proof in Section 4.2. 
Section 4.3 discusses the consequences of the oracle inequality, including a 
sharper version of Theorem 1 and the adaptive minimaxity in weak £ p balls. 
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4.1. Elements leading to the oracle inequality. It follows from the funda- 
mental theorem of compound decisions (2.5) that for separable estimators 
9 = t(X), the compound risk is identical to the MSE of £ = t(Y) for the 
estimation of a single real random parameter £ under Pq in (2.3), so that 
Theorem 3 provides an upper bound for the regret of the regularized Bayes 
rule ^(X;p) in terms of the Hellinger distance d(fa,fG n ) an d p > 0. We 
have proved in [39] a large deviation upper bound for the Hellinger distance 
d{fpi , fen)- We will show that the GMLEB estimator t~ (X) is identical to 
its regularized version (X;p n ) for certain |logp n | xlogn when the gen- 
eralized MLE (2.12) or its approximation (2.14) are used. Still, t~ (X;p n ) 

Gn 

is not separable, since the generalized MLE G n is based on the same data 
X. A natural approach of deriving oracle inequalities is then to combine 
Theorem 3 with a maximal inequality. This requires in addition an entropy 
bound for the class of regularized Bayes rules ^(x; p) with given p > and 
an exponential inequality for the difference between the loss and risk for each 
regularized Bayes rule. In the rest of this section, we provide these crucial 
components of our theoretical investigation. 

4.1.1. A large deviation inequality for the convergence of an approximate 
generalized MLE. Under the i.i.d. assumption of the EB model (2.25), 
Ghosal and van der Vaart [20] obtained an exponential inequality for the 
Hellinger loss of the generalized MLE of a normal mixture density in terms 
of the Lqo norm of This result can be improved upon using their newer 
entropy calculation in [21]. The results in [20, 21] are unified and further im- 
proved upon in the i.i.d. case and extended to deterministic 6 = (8\, . . . , 6 n ) 
in weak £ p balls for all < p < oo in [39]. This latest result, stated below as 
Theorem 4, will be used here in conjunction of Theorem 3 to prove oracle 
inequalities for the GMLEB. 

The pth weak moment of a distribution G is 

(4.1) ^(G) = (supx p / G(du)V /P 

lx>0 J\u\>x J 

with //^(G) = inf{x : fi u i >x G(du) = 0}. Define convergence rates 



e(n, G,p) = max[y2bi^, {n^P ^^^{G)Y ,{2+2p) ]J X ^ 
(4-2) , 



max 



'21ogn f /t —i^(G)\p/( 2 + 2 p) 



, Viol 



n- 



n i n 
with e(n, G, oo) = {(2 log n) V (yflagn^ (G))} 1 / 2 ^J{\ogn)/n. 



Vlog 



n 
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Theorem 4. Let X ~ N(6,l n ) under P nj g with a deterministic G W 1 . 
Let fc and G n be as in (2.13) and (2.4), respectively. Let G n be a certain ap- 
proximate generalized MLE satisfying (2.14). Then, there exists a universal 
constant x* such that for all x > x* and logn > 2/p, 



(4.3) 



Pn,e{d{ff; Jg„)> xE n ] < exp 



2 2 

2 log n 



<e 



-x log n 



where e n = e(n,G n ,p) is as in (4-2) and d(f,g) is the Hellinger distance 



(3.6). In particular, for any sequences of constants M n 
itive a and c, 



oo and fixed pos- 



n 



-p/(2+2p) 



(log n) 



(2+3p)/(4+4p) 



if fi™ (G n ) = O(l) with a fixed p, 

n-^Oogn) 3 / 4 ^ 72 V (logn) 1 / 4 }, 

ifG n ([-M n ,M n ]) = l andp = oo, 



n 



-1/2 



(logn) 



1/(2(2Aq))+3/4 



if J e\ cu \ a G n {du)=0{l) andp^log 



n. 



Remark 4. Under the condition G([—M n ,M n ]) = 1 and the i.i.d. as- 
sumption (2.25) with G depending on n, the large deviation bound in [20] 
provides the convergence rate e n x n _1//2 (logn) 1 / 2 {M„ V (logn) 1 / 2 }, and the 



entropy calculation in [21] leads to the convergence rate e r 



n 



-1/2 



(log n)^Mn- 



These rates are slower than the rate in Theorem 4 when M n / y/logn — > oo. 

Remark 5. The proof of Theorem 4 is identical for the generalized 
MLE (2.12) and its approximation (2.14). The constant x* is universal for 
q n = (e\f2ir/n 2 ) A 1 in (2.14) and depends on sup n | log q„ | / log n in general. 

4.1.2. Representation of the GMLEB estimator as a regularized one at 
data points. The connection between the GMLEB estimator (2.15) and 
the regularized Bayes rule (3.3) in Theorem 3 is provided by 

(4.4) e (X)=i£ (X;p n ), Pn =q n /(enV2^), 

where q n is as in (2.14). This is a consequence of the following proposition. 

Proposition 2. Let f(x\u) be a given family of densities and {Xi,i < 
n} be given data. Let G n be an approximate generalized MLE of a mixing 
distribution satisfying 

n „ n „ 

[] / f(Xi\u)G n (du)>q n sap]l / f(Xi\u)G(du) 



i=i 



t=i* 
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for certain < q n < 1. Then, for all j = 1, . . . , n 

/g (X,) = / f(Xj\u)G n (du) > ^sup/(X». 
n J en u 

In particular, (4-4) holds for f(x\u) = <p(x — u). 

Proof. Let j be fixed and Uj = &Tgm&xf(Xj\u). Define G n j = (1 — 
e)G n + eS Uj with e = 1/n, where 5 U is the unit mass at u. Since f(x\u) > 0, 
/g (Xi) > (1 - £)/g (Xi ) and /g (Aj-)>e/(-Xj-| ^ ) , so that 

-i n n 

- n ^ n ,w ^ a - ervcjfiitf,-) n 
yn i=i i=i 

Thus, /g (.Xj) > g n (l — e) n_1 e/(Xj|nj) with e = 1/n, after the cancellation 
of /g(Xj) for z 7^ j. The conclusion follows from (1 — l/n)™" 1 > 1/e. □ 

4.1.3. An entropy bound for regularized Bayes rules. We now provide an 
entropy bound for collections of regularized Bayes rules. For any family J4? 
of functions and semidistance do, the e-covering number is 

(4.5) N(e,J?,d ) =inf|iV:^C |J BaU(/ij-,e,db)| 

with Ball(/i, e, do) = {/ '-do(f,h) < e}. For each fixed p > define the com- 
plete collection of the regularized Bayes rules t G (x;p) in (3.3) as 

(4.6) £r p = {t* G (.; P ):Ge&}, 

where if is the family of all distribution functions. The following proposition, 
proved in the Appendix, provides an entropy bound for (4.6) under the 
seminorm ||/i||oo,Af = sup^^ \h(x)\. 

Proposition 3. Let L(y) = \J- log(27ry 2 ) be the inverse of y = tp(x) as 
in Proposition 1. Then, for all <r/ < p< (2-7re)~ 1 / 2 , 

IogJV(77*,5^,|| • ||oo,m) 

< {4(6L 2 (7?) + 1)(2M/L(rj) + 3) + 2}| logr/|, 

where rf = (rj/p){3L(r)) + 2}. 
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4.1.4. An exponential inequality for the loss of regularized Bayes rules. 
The last element of our proof is an exponential inequality for the differ- 
ence between the loss and risk of regularized Bayes rules £q(X; p). For each 
separable rule t(x), the squared loss ||i(X) — 0\\ 2 is a sum of independent 
variables. However, a direct application of the empirical process theory to 
the loss would yield an oracle inequality of the n -1 / 2 order, which is inad- 
equate for the sharper convergence rates in this paper. Thus, we use the 
following isoperimetric inequality for the square root of the loss. 



Proposition 4. Suppose X~ N(0,l n ) under P n 6 . Let t G (x;p) be the 
regularized Bayes rule as in (3.3), with a deterministic distribution G and 
0<p< (27re 3 )- 1 / 2 . Let L(p) = yj- log(2vr / o 2 ). Then, for all x > 

P n!0 {\\f G (X;p) - 0|| > £7 n ,e||*G(X; P) ~ H +x}< exJ--^—) . 



2L\p)J 

PROOF. Let h(x) = ||ig>(x; p) — 0\\. It follows from Proposition 1 that 
\h(x)-h(y)\<\\t G (x;p)-t G (y;p)\\ 

<\\x-y\\ S up\(d/dx)t G (x;p)\<L 2 (p)\\x-y\\. 

X 

Thus, h(x)/L 2 (p) has the unit Lipschitz norm. The conclusion follows from 
the Gaussian isoperimetric inequality [4]. See page 439 of [34]. □ 



4.2. An oracle inequality. Our oracle inequality for the GMLEB, stated 
in Theorem 5 below, is a key result of this paper from a mathematical point 
of view. It builds upon Theorems 3 and 4 and Propositions 2, 3 and 4 (the 
regularized Bayes rules with misspecified prior, generalized MLE of normal 
mixtures, representation of the GMLEB, entropy bounds and Gaussian con- 
centration inequality) and leads to adaptive ratio optimality and minimax 
theorems more general than Theorems 1 and 2. 

Theorem 5. Let X ~ N(0, I„) under P n with a deterministic G R n 
as in (2.1). Let L n (-,-) be the average squared loss in (2.2) and 0<p< oo. 
Let (X) be the GMLEB estimator (2.15) with an approximate generalized 

MLE G n satisfying (2.14)- Then, there exists a universal constant Mq such 
that for all logn > 2/p, 

r n , fl (&(X)) = JE nfi L n (& (X),0) - jR*(G n ) 

(4.8) 

< Af e n (logn) 3 / 2 , 
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where R*{G n ) is the minimum risk of all separable estimators as in (2.8) 
with G n = G n Q as in (2.4), and e n = e(n, G n ,p) is as in (4-2). In particular, 
for any sequences of constants M n — > oo and fixed positive a and c, 

■ n -p/(2+2p) ( ' logn )(2+3p)/(4+4p) ) 

if [ip(G n ) = O(l) with a fixed p, 

n- x / 2 (logn) 3 / 4 {Mn /2 V (logn) 1 / 4 }, 

ifG n ([-M n ,M n ]) = l andp = w, 

n -l/2(l ogn )V(2(2Aa))+3/4 ) 

if J e lcul ° G n (du) = 0(1) andp^logn. 

Remark 6. In the proof of Theorem 5, applications of Theorems 3 and 
4 resulted in the leading term for the upper bound in (4.18), while the 
contributions of other parts of the proof are of smaller order. 



Remark 7. The M in (4.8) is universal for q n = (e^f^n/n 2 ) A 1 in (2.14) 
and depends on sup n | log q n \/ logn in general. 

The consequences of Theorem 5 upon the adaptive ratio optimality and 
minimaxity of the GMLEB are discussed in the next section. Here is an 
outline of its proof. The large deviation inequality in Theorem 4 and the 
representation of the GMLEB in (4.4) imply that 

(4.9) (X) - 0|| < (X; p n ) - 6\\I An + Cm, Pn 



where A n = {d{f~ ,f Gn ) < x*e n } and Cin = ||*~ (X;/9 n ) - 0\\I A c with x* = 

V 1. By (3.2) and Proposition 1, \t* G {Xi;p n ) - (9;| < L(p n ) + |iV(0,l)|, so 
that Theorem 4 provides an upper bound for E n: eCi n - By the entropy bound 
in Proposition 3, there exists a finite collection of distributions {Hj,j < N} 
of manageable size iV such that 

(4.10) C2„ = (||tp (X;p n )-e\\I An -max\\t* H (X-,p n )-0\\\ 

is small and d(fH J , fc n ) < x*e n for all j < N. Since the regularized Bayes 
rules t* H .(X; p n ) are separable and the collection {Hj,j < iV} is of manage- 
able size, the large deviation inequality in Proposition 4 implies that 

(4.11) Csn = max{||^ (X; p n ) - 6\\ - E n>0 \\t* H (X; p n ) - 9\\} + 

3<N J J 

is small. Since c?(/j?-,/g„) < x*e n , Theorem 3 implies that 



(4.12) ( 4n = maxJE n , \\t* H .(X-,p n ) - 0|| 2 - JnR*{G ri 
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is no greater than 0{x*e n )(\og /5 n ) 3 / 2 , where R*(G n ) is the general EB bench- 
er 2 



mark risk in (2.8). Finally, upper bounds for individual pieces E n gQ n are 



put together via 



(4.13) jE^jiTJX) - 9p < \/.ifl>(G„) + 



\ 



( 4 



4.3. Adaptive ratio optimality and minimaxity. We discuss here the op- 
timality properties of the GMLEB as consequences of the oracle inequality 
in Theorem 5. 

Theorem 5 immediately implies the adaptive ratio optimality (2.21) of 
the GMLEB in the classes 0* = 0*(M n ) for any sequences of constants 
M n — > oo , where 

(4.14) e* n (M) = l0£R n :R*(G n , e )>M(logn) 3 inf e 2 (n, G n , e ,p)\ 

I p>2/logn J 

with G n g = G n as in (2.4) and e(n,G,p) as in (4.2). This is formally stated 
in the theorem below. 

Theorem 6. Let X ~ N(G,I n ) under P n Q with a deterministic 9 £ R n . 
Let t*~- (X) be the GMLEB estimator (2.15) with the approximate MLE G n 
in (2.14)- Let R*(G nj g) be the general EB benchmark in (2.8) with the dis- 
tribution G n = G rii e in (2.4)- Then, for the classes 0*(Af) in (4-14), 

(4.15) Urn sup {E n>e L n (t*~ (X), 0)/iT (G n , )} < 1. 

(n,M)^(oo,oo)0 g0 *( M ) 

Remark 8. Since the minimum of e(n,G nt g,p) is taken in (4.14) over 
p > 2/logn for each 6, the adaptive ratio optimality (4.15) allows smaller 
R*{G n ,e) than simply using s(n,G nt g,oo) does as in Theorem 1. Thus, The- 
orem 6 implies Theorem 1. 

Another main consequence of the oracle inequality in Theorem 5 is the 
adaptive minimaxity (2.28) of the GMLEB for a broad range of sequences 
n 6 W 1 . We have stated our results for regular l v balls in Theorem 2. In 
the rest of the section, we consider weak t v balls 

(4.16) 0« Ojn = {6 G R n : f$(G n>e ) < C}, 

where G n @ is the empirical distribution of the components of 6 and the 
functional ^p(G) is the weak moment in (4.1). Alternatively, 
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Theorem 7. Let X ~ N(6,l n ) under P nj g with a deterministic G W 1 . 
Let L n (-,-) be the average squared loss (2.2) and M n (@) be the minimax 
risk (2.27). Then, for all approximate solutions G n satisfying (2.14), the 
GMLEB = f~ (X) is adaptive minimax (2.28) in the weak l p balls Q n = 

Gn 

®p c„ n ^ n (4-16), provided that the radii C n are within the range (2.30). 

Here is our argument. The weak L p ball that matches (4.16) is 
^ C = {G:^(G)<C}. 

Let J p>c (X) = - J °°(t 2 A X 2 )d{lA (C/t) p }, which is approximately the Bayes 
risk of the soft-threshold estimator for the stochastically largest Pareto prior 
in & P)C . Let X PtC = \/l V |21og(l/CP A2 )}. Johnstone [23] proved that 

(4.17) km ^ 9 ^J-i 



w 
P,C n 



for p > 2 with C n -> C+ > and for p < 2 with nC£/(logn) 1+6 /P -> oo, 
and that M(y™ Cn ) / J™ Cn (\ p>Cn ) 1 as CP A2 0. Abramovich et al. [1] 
proved ^(e^ c J n )/J^„(A Pl cJ - 1 for p < 2 and (logn) 5 /n < < 
for all k > 0. The combination of their results implies (4.17) for p < 2 and 
CP > {\ognf/n. Therefore, (4.17) holds under (2.30) due to pk 1 (p)=p/2 + 
4 + 3/p > 5 for p < 2. As in Section 2.7, 



(4.18) sup r n: e{t n ) = o(l)JJ p ,c n 

as in (2.37), due to J p>Cn x 0(? p ,c n ) < ^(^cj- 

5. More simulation results. In addition to the simulation results re- 
ported in Section 2.4, we conducted more experiments to explore a larger 
sample size, sparse unknown means without exact zero, and i.i.d. unknown 
means from normal priors. The results for the nine statistical procedures 
and the oracle rule t*Q (X) for the general EB are reported in Tables 2-4, 
in the same format as Table 1. Each entry is based on an average of 100 
replications. In each column, boldface entries indicate top three performers 
other than the hybrid estimator or the oracle. Two columns with \i = 4 are 
dropped to fit the tables in. 

In Table 2 we report simulation results for n = 4000. Compared with 
Table 1, F-GEB replaces EBThresh as a distant third top performer in the 
moderately sparse case of #{i : 6i = ji\ = 200, and almost the same sets of 
estimators prevail as top performers in other columns. Since the collections 
of G n are identical in Tables 1 and 2, the average squared loss \\6 — 6\\ 2 /n 
should decrease in n to indicate convergence to the oracle risks for each 
estimator in each model, but this is not the case in entries in italics. 
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Table 2 



Average 


of\\e 


-0|| 2 :t7, = 4OOO, Oi 






= **} = 


20, 200 or 2000 




# nonzero 




20 








200 






2000 




M 


3 


4 


5 


7 


3 


5 


7 


3 


5 


7 


James-Stein 


175 


298 


446 


790 


1243 


2229 


2846 


3261 


3689 


3829 


EBThresh 


145 


120 


63 


577 


861 


404 


290 


3411 


3118 


2621 


SURE 


174 


270 


329 


355 


1725 


827 


827 


3296 


3317 


3317 


FDR (0.01) 


175 


202 


103 


26 


1569 


506 


231 


10,230 


2607 


2090 


FDR (0.1) 


161 


138 


70 


48 


1121 


450 


409 


4578 


2597 


2563 


GMLEB 


141 


115 


68 


30 


624 


215 


43 


1808 


489 


62 


S-GMLEB 


116 


92 


45 


10 


597 


193 


23 


1791 


479 


53 


F-GEB 


243 


231 


166 


156 


739 


353 


229 


1907 


641 


253 


HF-GEB 


145 


120 


63 


377 


694 


286 


159 


1868 


576 


171 


Oracle 


110 


84 


40 


3 


587 


186 


16 


1771 


460 


36 



In Table 3, we report simulation results for sparse mean vectors without 
exact zero. It turns out that adding uniform [—0.2, 0.2] perturbations to 9i 
does not change the results much, compared with Table 1. 

In Table 4, we report simulation results for i.i.d. #j ~ -/V(/i,cr 2 ). This is 
the parametric model in which the (oracle) Bayes estimators are linear. 
Indeed, the James-Stein estimator is the top performer throughout all the 
columns and tracks the oracle risk extremely well, while the GMLEB is not 
so far behind. It is interesting that for a 1 = 40, the EBThresh and SURE 

Table 3 



Average of \\0 - 


-0f: 


n= 1000, 


9i = 


jitj + unif 


-0.2,0.2], (M 


€{0,/i}, #{i:fu 




= 5, 50 










or 500 












#0**0} 




5 








50 






500 




M 


3 


4 


5 


7 


3 


5 


7 


3 


5 


7 


James-Stein 


57 


87 


124 


207 


316 


559 


713 


817 


932 


971 


EBThresh 


48 


44 


31 


23 


226 


115 


87 


855 


797 


677 


SURE 


55 


75 


84 


89 


426 


221 


220 


830 


845 


848 


FDR (0.01) 


56 


62 


37 


20 


395 


137 


72 


2555 


676 


541 


FDR (0.1) 


53 


49 


34 


27 


289 


130 


116 


1152 


666 


664 


GMLEB 


49 


45 


32 


23 


170 


70 


27 


466 


146 


32 


S-GMLEB 


45 


41 


29 


19 


164 


67 


24 


462 


145 


31 


F-GEB 


115 


108 


105 


91 


238 


155 


118 


534 


244 


145 


HF-GEB 


48 


44 


31 


23 


210 


113 


85 


509 


203 


101 


Oracle 


39 


35 


23 


14 


158 


61 


18 


454 


135 


22 
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Table 4 





A 


verage 


ofp-e 


|| 2 : n = 


1000, i. 


i.d. 6j r. 




) 












0.1 






2 






40 






3 


4 


5 


7 


3 


5 


7 


3 


5 


7 


James-Stein 


92 


92 


92 


93 


665 


670 


665 


970 


982 


975 


EBThresh 


1081 


1058 


1035 


1020 


1013 


1032 


1014 


983 


998 


997 


SURE 


1006 


1505 


3622 


13,146 


988 


1033 


3514 


983 


998 


996 


FDR (0.01) 


3972 


2049 


1169 


999 


2789 


1599 


1050 


1661 


1566 


1427 


FDR (0.1) 


1555 


1093 


1002 


998 


1455 


1096 


999 


1184 


1161 


1117 


GMLEB 


94 


94 


95 


95 


675 


678 


673 


1001 


1015 


1009 


S-GMLEB 


97 


98 


99 


98 


678 


681 


675 


1002 


1015 


1009 


F-GEB 


171 


171 


175 


171 


735 


743 


736 


1107 


1130 


1122 


HF-GEB 


138 


139 


143 


142 


721 


726 


720 


1067 


1088 


1079 


Oracle 


91 


90 


91 


90 


665 


669 


664 


970 


981 


975 



outperform GMLEB as they approximate the naive 9 = X with diminishing 
threshold levels. Another interesting phenomenon is the disappearance of 
the advantage of the S-GMLEB over the GMLEB, as the unknowns are no 
longer sparse. 

6. Discussion. In this section, we discuss general EB with kernel esti- 
mates of the oracle Bayes rule, sure computation of an approximate gener- 
alized MLE and a number of additional issues. 



6.1. Kernel methods. General EB estimators of the mean vector 6 can 
be directly derived from the formula (3.2) using the kernel method 

e = t n (X), t n {x)= x + 



fn(x)V Pn 

(6-1) 

fn{x) = 2^ • 

r— f na n 

i=i n 

This was done in [36] with the Fourier kernel K{x) = {svnx) /{nx) and 
v / 21ogn < a n x yTogn. The main rationale for using the Fourier kernel is the 
near optimal convergence rate of f n — fc n = 0(^J (logn)/n) and f n — f' G = 
O ((log n)/y/n), uniformly in 6. However, since the relationship between 
fn(x) and f n {x) is not as trackable as in the case of generalized MLE /g , a 

much higher regularization level p n x y/ (logn)/n in (6.1) were used [36, 38] 
to justify the theoretical results. This could be an explanation for the poor 
performance of the Fourier general EB estimator for very sparse 6 in our 
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simulations. From this point of view, the GMLEB is much more appealing 
since its estimating function retains all analytic properties of the Bayes rule. 
Consequently, the GMLEB requires no regularization for the adaptive ratio 
optimality and adaptive minimaxity in our theorems. 

Brown and Greenshtein [6] have studied (6.1) with the normal kernel 
K(x) = (p(x) and possibly different bandwidth l/a n , and have proved the 
adaptive ratio optimality (2.21) of their estimator when ||0||<x> an d R*{G n ^) 
have certain different polynomial orders. The estimating function t n (x) with 
the normal kernel, compared with the Fourier kernel, behaves more like the 
regularized Bayes rule (3.3) analytically with the positivity of f n (x) and 
more trackable relationship between f' n {x) and f n {x). Still, it is unclear 
without some basic properties of the Bayes rule in Proposition 1 and Theo- 
rem 3, it is unclear if the kernel methods of the form (6.1) would possess as 
strong theoretical properties as in Theorems 1, 2, 5, 6 and 7 or perform as 
well as the GMLEB for moderate samples in simulations [6]. 

6.2. Sure computation of an approximate general MLE. We present a 
conservative data-driven criterion to guarantee (2.14) with the EM-algorithm. 
This provides a definitive way of computing the map from {X{\ to G in (2.14) 
and then to the GMLEB via (2.18). 

Set u\ = mhii<j< n Xi, u m = maxi<i< n Xi, and 

(6.2) e = (u m - ui)/{m - 1), Uj = Uj-\ + e. 

Proposition 5. Suppose e 2 {(u m - ui) 2 /A + 1/8} < 1/n with a suffi- 
ciently large m in (6.2). Let wf ] > Vj < m with ET=iwf ] = 1. Suppose 
that the EM-algorithm (2.19) is stopped at or beyond an iteration k > with 

(6.3) max log^^V^^ ^) < — log [ — — | . 

i<j<m 3 1 n \eq n J 

Then, (2.14 ) holds for G n = J2?=i wf ] ^ Uj ■ 

Heuristically, smaller m provides larger minj w - and faster convergence 
of the EM-algorithm, so that the "best choice" of m is 

m — 2 < (u m — u\)\J n{(u m — «i) 2 /4 + 1/8} < m — 1. 

For max, x ^/logre, this ensures the first condition of Proposition 5 with 
m x (\ogn)y/n and e >c (nlogra) -1 / 2 . Proposition 5 is proved via the smooth- 
ness of the normal density and Cover's upper bound [10, 35] for the maxi- 
mum likelihood in finite mixture models. 
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6.3. Additional remarks. A crucial element for the theoretical results in 
this paper is the oracle inequality for the regularized Bayes estimator with 
misspecified prior, as stated in Theorem 3. However, we do not believe that 
mathematical induction is sharp in the argument with higher and higher 
order of differentiation in the proof of Lemma 1 . Consequently, the power k\ 
in Theorems 2 and 7 is larger than its counterpart more directly established 
for threshold estimators [1, 24]. Still, the GMLEB performs as well as any 
threshold estimators in our simulations for the most sparse mean vectors. As 
expected, the gain of the GMLEB is huge against the James-Stein estimator 
for sparse means and against threshold estimators for dense means. 

It is interesting to observe in Tables 1-3 that the simulated £2 risk for 
the GMLEB sometimes dips well below the benchmark J27=i ®l A 1 = #{2 <■ 
n : 9i 7^ 0} for the oracle hard threshold rule 6% = < 1} [18], while the 

simulated £2 risk for threshold estimators is always above that benchmark. 

An important consequence of our results is the adaptive minimaxity and 
other optimality properties of the GMLEB approach to nonparametric re- 
gression under suitable smoothness conditions. For example, applications 
of the GMLEB estimator to the observed wavelet coefficients at individual 
resolution levels yield adaptive exact minimaxity in all Basov balls as in [38] . 

The adaptive minimaxity (2.28) in Theorems 2 and 7 is uniform in the 
radii C for fixed shape p. A minimax theory for (weak) £ p balls uniform 
in (p, C) can be developed by careful combination and improvement of the 
proofs in [12, 23, 38]. Since the oracle inequality (4.8) is uniform in p, uni- 
form adaptive minimaxity in both p and C is in principle attainable for the 



The theoretical results in this paper are all stated for deterministic = 
(#1, . . . ,0 n ). By either mild modifications of the proofs here or conditioning 
on the unknowns, analogues versions of our theorems can be established for 
the estimation of i.i.d. means {&} in the EB model (2.25). Other possible 
directions of extension of the results in this paper are the cases of Xj ~ 
N(0i,(Jn) Yia sca le change, with known cr„ or an independent consistent 
estimate of <r^, and Xi ~ N(9i,af) with known of. 



Here we prove Proposition 1, Lemma 1, Proposition 3, Theorems 5, 2 
and 7, and then Proposition 5. We need one more lemma for the proof of 
Proposition 1. Throughout this appendix, [x\ denotes the greatest integer 
lower bound of x, and \x~\ denotes the smallest integer upper bound of x. 

Lemma A.l. Let fc{x) be as in (2.13) and L(y) as in Proposition 1. 
Then, 



GMLEB. 



APPENDIX 




2 



( 



1 



) 



< 



Scix) 



+ l<L 2 (/ G (x))=log 



2itp G {x) 



Vx. 
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PROOF. Since Y\£ ~ N(£, 1) and £ ~ G7 under P G , by (3.2) 



E G [i-Y\Y = xl 
Jg{x) 

+ l = E G [{i-Y) 2 \Y = x]. 



f G (x) 

This gives the first inequality of (A.l). The second inequality of (A.l) follows 
from Jensen's inequality: for h(x) = e x l 2 



h T/K + 1 )<E G [h((C-Yy)\Y = x] 



Jg(x) J ~ V^f G (x)' 
This completes the proof. □ 

Proof of Proposition 1. Since f G (x) = J <p(x — u)G(du) > 0, the 
value of (3.3) is always between t G (x) and x. By Lemma A.l 

\x ~ t G (x; p)\ < f f x) L(f G (x)) < L{p) 
Jg(x)V p 

for p< (27re)~ 1 / 2 , since L(y) is decreasing in y 2 and y 2 L 2 (y) is increasing 
in y 2 < l/(2-7re). Similarly, the second line of (3.4) follows from Lemma A.l 
and 

dt G (x;p) = f 1 + f G (x)/f G (x) - {f G (x)/f G (x)} 2 , f G (x) > p, 
dx ~ \ 1 + / G (z)/p, f G (x) < p. 

Note that L(f G (x)) < L(p) for / G (x) > p, and for / G (x) < p < (2vre 3 )- 1 / 2 
< 1 - Mf) < i + #M < i + M^1(L 2 (/ G (x)) - 1) < L 2 (p) 

p p p 

due to the monotonicity of y{L 2 (y) — 1} in < y < (2-7re 3 )~ 1 / 2 . □ 

Proof of Lemma 1. Let D = d/dx. We first prove that for all integers 
k > and a > \/2k-l, 

(A.2) J {D k (f G - f Go )} 2 dx < -^d 2 (f G , f Go ) + 



7T 

Let h*(u) = J e lux h(x) dx for all integrable h. Since \f G (u)\ < tp*(u) = e"" 2 / 2 , 
it follows from the Plancherel identity that 

/ {D k (fG ~ fGo)} 2 dx=^J u 2k \r G (u) - f Go (u)\ 2 du 

n 2k r A r 

<^j\fU«)-rcM'^+ T J M>a 



,2k -u 2 



a 2k 



f 4 

/ \fc ~ fc \ 2 dx + -c fc , 
J vr 
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where c k = J u>a u 2k e~ u2 du. Since (k — 1/2) < a 2 /2, integrating by parts 
yields 

c k = 2- 1 a 2k - l e-° 2 + {{k- l/2)/a 2 }a 2 c fc _i 

< 2- 1 a 2fe ~ 1 e- a2 (l + 1/2 + • • • + l/2 fc ~ 1 ) + 2~ k a 2k c 

due to co < a" 1 j u>a ue ~ u2 du = e~ a2 /(2a). Since f G (x) < l/v^, 

/ \f G -f Go \ 2 dx< \\y/fc+ y/foWlcdPtfaJGo) < ^d 2 (f G J Go ). 

The combination of the above inequalities yields (A.2). 

Define w* = l/(/ G V p + f Go V p) and A k = U{D k {f G - f Go )} 2 w*) 1 ' 2 . 
Integrating by parts, we find 

A 2 = - J {D k -\f G - f Go )}{D k+1 (f G - / G >* + (D k (f G - f Go ))(Dw*)}. 

Since \(Dw*)(x)\ <2L(p)w*(x) by Proposition 1, Cauchy-Schwarz gives 

A 2 k < A fc _iA fc+ i + 2L(p)A fc _iA fe . 

Let ko be a nonnegative integer satisfying ko < L 2 (p) /2 < ko + 1. Define k* = 
m\n{k:A k+1 < k 2L(p)A k }. For k < k* , we have A 2 k < (1 + l//c )A fc _ 1 A fc+ i, 
so that for k* < ko, 

Since (/ G /2 + f G ( 2 ) 2 w, < 2, we have A 2 < 2d 2 (f G J Go ). Thus, for k* < k 

(A.3) Ai<eL 3 (p)v^d(/G,/ Go ). 

For k <k*, Ai/A < (1 + l//c ) fc A fc+ i/A fc for all fe < fc , so that 

- fc ] V(fco+l) 

n{(l + lAo) fe A fc+1 /A fc } 

Lfe=0 

(l + l/A:o) fco/2 {A, ()+1 /Ao} 1/(fco+1) . 
To bound A ko +i by (A.2), we pick the constant a > with the a 2 in (3.9), 
so that a 2 > 2(ko + 1/2) and e~ a < d 2 (f G ,f Go ). Since w* < l/(2p), an ap- 
plication of (A.2) with this a gives 

|2 



(A.4) 



A " 



A 2 0+1 <^-|{^ +1 (/ G -/ Go )} s 



2p 

2 „2(fe +l) , 

< 7 =^d 2 (/ G ,/ Go )(l + a- 1 ^2/7r). 

p\J2-K v 



(A.5) 



<(l + l/k ) ko / 2 V2d(f G J Go )a 
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Since Aq < 2d 2 {fa, fa ), inserting the above inequality into (A. 4) yields 

Ai < (1 + l/A;o) fc(,/2 A^ /(fco+1) A^S ,+1) 

•l + N /27^\l/(2fco+2) 

py/2n ) 

< v^2d(/ G ,/ Go )aV2(27rp 2 )- 1 /( 4fc ° +4 ). 
Since | log(2vr/) 2 )| = L 2 (p) < 2k + 2, (3.9) follows from (A.3) and (A.5). □ 

Proof of Proposition 3. We provide a dense version of the proof 
since it is similar to the entropy calculations in [20, 21, 39]. 
It follows from (3.3), (3.4) and Lemma A.l that 

(A.6) \t* G (x;p) - t* H {x;p)\ < l -\f' G (x) - f' H (x)\ + ^\f G (x) - f H (x% 

so that we need to control the norm of both f G and f' G . 

Let a = L(r]), j* = [~2M/a + 2] and k* = [6a 2 \ . Define semiclosed intervals 



(-M + (j - 2)a, (-M + (j - l)a) A (M + a)], 



3 = 1,---,J , 



to form a partition of {—M — a, M + a] . It follows from the Caratheodory's 
theorem [9] that for each distribution function G there exists a discrete 
distribution function G m with support [-M — a,M + a] and no more than 
m = (2k* + 2)j* + 1 support points such that 

u k G(du) = [ u k G m (du), 



(A.7) 

fc = 0,l > ... l 2** + l > j = l,...,f . 
Since the Taylor expansion of e - * 2 / 2 has alternating signs, for t 2 /2 < k* + 2 



k* 



< Rem(i) 



1 to *>ly/to 



< 



(t 2 /2) 



k*+l 



(fc* + l)!V27r 



Thus, since k* + 1 > 6a 2 , for x £ Ij n [-M, M], the Stirling formula yields 



< 



(A. 



(ij-iuijuij^r 



(x — u)ip(x — u){G(du) — G m (du)} 



+ 



'j-lUijU/j+i 



(x — u) Rem(j; — u){G(du) — G m (du)} 



, . 4a{ 2a 2 /2} fc + 1 4a(e/3) fc +1 

< maxto(i + — V— < an + — ,\ 7 ; 

- t>a yy ' v^(fc* + l)! " 27r fc* + l V2 
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due to a > 1. Similarly, for |x| < M 

(e/3) fc * +1 



(A.9) \f^)-fo m ^)\<V+ M ^ + 1)1/2 - 

Furthermore, since (e/3) 6 < e~ 1 / 2 and k* + 1 > 6a 2 > 6, we have (e/3) fc * +1 < 
e" a2 /2 = y /2jhj ) so t hat by (A.6), (A.8) and (A.9) 

ll*G(-;p)-*G m (-;p)lloo,M 



<p- 1 r ? (2L(7 ? ) + 5/v / 12^). 

Let £~G m , ^ = r/sgn(^)L|C|/??J and G m>r) ~^. Since < 

WfCm ~ /G m ,J|oo < Cl??) Il/G m ~~ /G m ,,l|oo < C|t7, 

where C\ = su Pa; |^(x)| = (2evr)- 1 /2 and q* = SUVx \(f/'( x )\ = y^e^' 2 . 
This and (A.6) imply 

(A.11) \\thJ-; P )-th m J-; P )\\oo<^ + CtL(p)}. 

Moreover, G m „ has at most m support points. 

Let & m be the set of all vectors w = [w\, . . . , w m ) satisfying Wj > and 
Y!f=i wj = 1. Let @> m v be an r/-net of N(r], &> m , \\ ■ ||i) elements in @> m : 

inf llw- w m ' r '||i <ri Vw£# m . 

Let {uj,j = 1, . . . ,m} be the support of G m ^ and w m,7? be a vector in 
with £ ™i |<W{%» - w?'"} < V . Set G m>v = i Then, 

WfG m , v ~ fa mr) lloo < Cq 77, Il/G m ,„ - /g m?) lloo < C*rj, 

where = ip(0) = This and (A.6) imply 

(A.12) \\f Gm J. ]p )- t * d (s^lloo < V z {Cl+CfL(p)}. 



m,ri p 



The support of G m ^ and is £l Vt M = {0, ±77, ±2?y, . . .} n [— M — a, M + a] 
Summing (A.10), (A. 11) and (A.12) together, we find 



*(?(•» A>)-<p; (sp)I|oo,m 

<~*"m,77 



< (r//p)[{2 + CT + C *}L(r/) + 5/^12^ + C* 2 + C*] 

< (r//p){2.65L(r/) + 1.24} < ? ? *. 
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Counting the number of ways to realize {uj} and w" 1 '^, we find 

(A.13) N(r,*, 3T P , || • |U,m) < ( '^ M| ) N(tj, || • Ik), 

withm = (2k* + 2)j* + l, \tt ViM \ = l + 2[(M + a)/rj\ , a = L(rj), j* = \2M/a + 
2] and k* = [Qa 2 \ . 

Since & m is in the t\ unit-sphere of M m , N(rj, £? m , \\ ■ ||i) is no greater 
than the maximum number of disjoint Ball (vj, i]/2, \\ ■ ||i) with centers Vj in 
the unit sphere. Since all these balls are inside the (1 +r]/2) £i-ball, volume 
comparison yields N(r], £P m , \\ ■ \\\) < (2/7/ + l) m . With another application 
of the Stirling formula, this and (A.13) yield 

N( V *, P p , || • ||oo,Af ) < (2/?7 + l) m \Qr, M \ m /m\ 

(A.14) < {(1 + 2/fj)(l + 2(M + a)/r,)} m {{m + l^+^V^VM -1 

< [{rj + 2)(t? + 2(M + a))e/(m + l)]"V 2m e{2^(m + 1)}" 1/2 . 

Since m - 1 > 12a 2 (2M/a + 2) = 24a(M + a) and a > 1 > 1/2 > 77, 

(t? + 2)(r? + 2(M + a))e < 8{l/2 + 2(M + a)} < m + 1. 

Hence, (A.14) is bounded by rf 2m with m < 2(6a 2 + l)(2M/a + 3) + 1. □ 

Proof of Theorem 5. Throughout the proof, we use Mq to denote a 
universal constant which may take different values from one appearance to 
another. For simplicity, we take q n = (e\/2ir/n 2 ) A 1 in (2.14) so that (4.4) 
holds with p n = n~ 3 . 

Let e n and x* be as in Theorem 4 and L{p) = y/ — log(27r/3 2 ) be as in 
Propositions 1 and 4. With p n = n -3 , set 

(A.15) v=^ = ~ 4 , V * = ^{3L( v ) + 2}, M = -^- 2 . 

Let x* = max(x*, 1) and {t* H . (-;p n ), j < N} be a (2r/*)-net of 
(A.16) P Pn n{t* G :d(f G J Gn )<x*e n } 

under the || • ||oo,Af seminorm as in Proposition 3, with distributions Hj 
satisfying d{f H] J Gn )<x*e n and N = N(rf, ST pn , || • ||oo, m)- It is a (2r/*)-net 
due to the additional requirements on Hj. Since M > Ay/logn and i] = 1/re 4 
by (4.2) and (A.15), Proposition 3 and (A.15) give 

(A.17) logiV < M (logn) 3 / 2 M/2 < M ne 2 n . 

We divide the £2 distance of the error into five parts: 

, 4 

\\t*Q (X;p n ) - 0|| < \JnR*{G n ) + £G„, 

3=1 
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where Cj n are as in (4.9), (4.10), (4.11) and (4.12). As we have mentioned 
in the outline, the problem is to bound E n ^Q n in view of (4.13). 

Let A n and ("in be as in (4.9). Since x* = 1 V > 1 and ne 2 n > 2(logn) 2 
by (4.2), Theorem 4 gives 

P n ,e{ A n} < exp(-(x*) 2 n4/(21ogn)) < 1/n. Thus, 
since L 2 (p n ) = — log(27r/n 6 ) with p n = n~ 3 , Proposition 1 gives 



En,e(ln — E n ,0 y^{(^g(^i; Pn) — Xi) + (-^Q — ^i)} 2 ^^ 
i=l 

n 

< 2nL 2 (p n )P ni0 {A c n } + 2£ n>0 ^(X l - O^Ia* 

i=l 

/•oo 

<M logn + 2n / min(P{|iV(0, 1)| > x}, 1/n) dx 2 . 
Jo 

Since P{N(0, 1) > x} < e~ x2 / 2 and J~ min^e"^/ 2 , 1) dx 2 /2 = 1 + logn, 

(A.18) E nt e( 2 n < M log n < M ne 2 n . 

Consider £f n . Since t* H .(-;p n ) form a (2r/*)-net of (A. 16) under || • ||oo,M 
and \tQ(x;p) — x\ < L(p) by Proposition 1, it follows from (4.10) that 

C| n < mm||^(X ;/0n ) -^ 3 (X;p„.)|| 2 /A„ 

< (27?*) 2 #{i : \Xi\ <M} + {2L(p n )} 2 #{i : \Xi\ > M}. 
By (4.2), (ne 2 /logn) p+1 > n{y/\6~gn~p™(G n )} p , so that by (4.1) and (A.15) 

^ J\u\>M/2 V M/2 J 

( 2ns^ \^ s n 



M (logn) 3 / 2 J logn logn 
Thus, since n* = n- 1 {3L(n~ 4 ) + 2} and M > 4^/tog^ by (A.15) and (4.2), 
En,eCL < n(2n*) 2 + AL 2 {n-^E n ^{i :\Xi\> M} 



< Mq (log n)n 



n* J\u\>M/2 
/I nF 2 9 

<Mo(logn)(- + pL + - 
\ n log n n 

Since ne 2 > 2(logn) 2 by (4.2), we find 
(A.20) £ n , Cfn < M>"4- 



i + / G n {du) + P{\N(0, 1)| > 2VTog^} 

l Z J\u\>M/2 
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Now, consider £| n . Since L 2 {p n ) < Mologn, it follows from (4.11), Propo- 
sition 4 and (A. 17) that 

E n,0(in = / Pn,o{(3n > x} dx 2 
JO 

poo _ 

(A.21) < / mm{l,Nexp(-x 2 /{2L A (p n )))}dx 2 

Jo 

= 2L i (p n )(l + logN) < M (\ogn) 2 ne 2 n . 

For C| n , it suffices to apply Theorem 3(h) with Go = G n , G = Hj, p = 
p n = n -3 , x = M/2 and e = x*e n > d(f H] ,/ G J, since 

(A.22) Cln < ^feii^ift) - £} 2 - R*(G n )} 

by (4.12) and (2.5). It follows from (A.19) that the M\ in Theorem 3(h) is 
no greater than 

I\u\>m/2 Gnjdu) < 4/logn < M 
|logp n | 3 (x*e n ) 2 ~ (logra) 3 e 2 ~ 

Since M = 2ne 2 /(logn) 3 / 2 by (A.15) and ne 2 n > 2(logn) 2 by (4.2), the M 2 
in Theorem 3(h) is no greater than 

2(M/2 + l) Pn 2(ng 2 /(logn) 3 / 2 + l)/n 3 Vbg^+1 



(logp„) 2 (x*e n ) 2 (31ogn) 2 e 2 n 2 (logn) 4 

with p n = n~ 3 . Thus, by Theorem 3(h) and (A.22) 
(A.23) Cln < M rt|(logp n )/3| 3 4 = M Q ne 2 n (\ognf . 

Adding (A.18), (A.20), (A.21) and (A.23) together, we have 



E n,o(j^\(jn\j <M n4(log 



n) 3 . 



|2 



Since L n (9,6) = \\0 — 9\\ /n, this and (4.13) complete the proof. □ 

Proof of Theorem 2. As we have mentioned, by (2.34), (2.35) and 
(2.36), the adaptive minimaxity (2.28) with Q n = Q Pt c„,n follows from (2.37). 
By (4.1) and (2.29), n$(G nt o) < C for 9 G Q Pt c, nt so that by (4.2) and The- 
orem 5, swp 9e Q pOn r nj0 (t n ) < e P:C ,n(Logn) 3 / 2 with 

(A.24) e 2 Cjn = max[21ogn,{nC p (logn) p/2 } 1/(1+p) ](logn)/n. 
Thus, it suffices to verify that for sequences C n satisfying (2.30), 
(A.25) 4 Cnin (logn) 3 /J p , Cn ^0, 
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where J PiC = min{l, CP A2 {1 V {2log(l/C p ))}^ p ^+} as in (2.37). 
We consider three cases. For Cl Ap > e" 1 / 2 , J P)Cn > e" 1/2 and 



~-p,C n ,n 



(logn) c 



max 



2(logn) 5 



?? 



{C n (lognf 2+4 /P / n y^ 1+ ^ 



since k 2 ( P ) = 9/2 + 4/p in (2.30). 

For p < 2 and C% < e" 1 / 2 , J pfin = CP {2log(l/ CP)} 1 ^/ 2 , so that by (A.24) 



max 



2(logn) J 



(log n) 



4+p/(2+2p) 



n^{log(l/C^)} 1 -P/ 2 ' (nC£)P/( 1 +P){log(l/C'£)} 1 -P/ 2 



Since the case CP > n 1 / 2 is trivial, it suffices to consider the case CP < n 
where 



1/2 



P,C n ,n 



(logn)" 



J, 



max 



p,C„ 



(l0 gn ) 4 +P/ 2 (l ogn )3+p/2+ P /(2+2 P ) 



nCl 



Since 4 + p/2 < p«i(p) = 4 + 3/p + p/2 = (1 + l/p){3 + p/2 + p/(2 + 2p)}, 
(2.30) still implies (A.25). 

Finally, for p > 2 and C 2 < e" 1 / 2 , J p ,c n = C 2 , so that 



,n( lo g n ) 



max 



p,C n 



2(log n) 5 f C n (log n) 9 / 2+4 /P 1 p'^ 1 ^ 



n 



r<2 



nC, 



2(l+l/p) 



Since nG^ +2/p = n 1 / 2 - 1 /P( n C£) 1 /2+ 1 /P, we need (i og n) 5 /(nC 2 ) -c for p > 2 
and (logn) 13 / 2 /(nC 2 ) -> for p = 2. Again (2.30) implies (A.25). □ 

Proof of Theorem 7. Since the oracle inequality (4.8) is based on 
the weak l p norm, the proof of Theorem 2 also provides (4.18). □ 



Proof of Proposition 5. Let G* n be the exact generalized MLE as 
in (2.12). Since tp(x) is decreasing in |x|, we have G^([ui,u m ]) = 1. Let 
Ij = (uj-\,Uj] and I* = [uj-i,Uj] for j > 2 and I\ = I* = {tti}. Let flmj be 
sub-distributions with support fl /* such that 



H m j(Ij) - G* n (Ij), 



uH m j(du) = / uG* n (du 



(A.26) 

1 < j < m. 

Let j > 1 and x G [«i,u m ] be fixed. Set Xj = x — (itj + Uj-\)/2 and i = 
it — (uj +Uj_i)/2 for u 6 /*. Since |xji| < (it m — u\)e/2 < n~ x l 2 < 1, 



(A.27) 



(1 _ e -* 2 /2 )e ^ < e ^*-* 2 /2 _ (1 + Xjt) < x 2 t 2 ga 
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where the second inequality follows from e * 2 / 2 (l — xjt) < e Xj * . Since ip(x ■ 

c 

y?(a; — u)G* n {du) — / — u)H m j(du) 



u) = cp(xj -t) = ip(xj) exp(xjt - t 2 /2), (A.26) and (A.27) yield 



<J x 2 t 2 p(x-u)G* n (du) + I (e* /2 - l)tp(x - u)H md (du) 



<(w m -ni) 2 (e/2) 2 / p(:r-tx)G£(du) 
+ (e e2 / 8 -l) / ^(x-u)tf mj (du). 



Let H m = jyjLi Hm,j- Summing the above inequality over j, we find e e2 / 8 x 
fH m (x) > (1 - v)fr« (x) with t? = e 2 (u m - u^/A < 1/n - e 2 /8. Thus, 

(A.28) T^4^T ^ ^"^(l " v) n > e~ n ^/ 8+ ^ > e~\ 

Let M'm be the set of all distributions with support {u±, . . . ,u m } and G n = 
Y^LiWj d Uj . The upper bound in [10, 35] and (6.3) provide 

A fn(X l ) . ( wf y 1 

SU P ^ t-ttt- < max — 7Tr~T\ I < • 

This and (A.28) imply Uf=i /g. < ?n 1 FILi /g □ 
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